Is The Electoral College Misleading?

By: Arun Dhingra and Arjun Kothakota

In [1]:
!pip3 install plotly
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from numpy import exp
import numpy as np
import seaborn as sns
import re
import matplotlib.pyplot as plt
import plotly
import plotly.graph_objs as go
import plotly.offline as offline
from plotly.graph_objs import *
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import warnings
warnings.simplefilter('ignore')
Requirement already satisfied: plotly in /opt/conda/lib/python3.8/site-packages (4.14.1)
Requirement already satisfied: six in /opt/conda/lib/python3.8/site-packages (from plotly) (1.15.0)
Requirement already satisfied: retrying>=1.3.3 in /opt/conda/lib/python3.8/site-packages (from plotly) (1.3.3)
In [2]:
init_notebook_mode(connected=False)

Introduction

The US Elections are always a source of controversy. In the past, a few presidential candidates have been able to win without securing the majority of voters in America. For example, in 2000, Al Gore won the popular vote, but lost the presidential election to George Bush. However, the electoral college debate has been struck up again in 2016 due to Hillary Clinton's loss despite gaining a 1,000,000-vote lead against her opponent, Donald Trump. https://en.wikipedia.org/wiki/List_of_United_States_presidential_elections_in_which_the_winner_lost_the_popular_vote

In the Electoral College, each state is assigned a number of votes (based on its population) which is updated every decade with the results of the decennial census. For example, California, the most populous state, has 55 electoral votes, while Wyoming has 3 electoral votes. The Electoral College determines its votes based on the popular votes of each state. For example, if candidate who wins the majority vote in California will get all 55 of its electoral votes. However, there are two exceptions, Maine and Nebraska split up their 3 popular votes based on the popular vote in each congressional district.

One of the main criticisms of the Electoral College is that the ‘winner takes all’ system for each state seem to neglect all of the voters who actually voted for the candidate. The origins of the electoral college have also been correlated with voter suppression; in the words of James Madison, “The right of suffrage was much more diffusive in the Northern than the Southern States; and the latter could have no influence in the election on the score of the Negroes.” Additionally, the results of the electoral college seem to be a source of resounding misinformation. Many people tend to see a map of electoral college results and think that there is a blatant majority of winners; however, these maps against states do not accurately reflect the outcomes of an election.

In this notebook, we delve into the misconceptions about the results of elections and take a critical position against the electoral college. We will first look into how much a winning candidate from an election year actually wins in each state and look into how it affects the results of the electoral college. We will then attempt to predict the outcomes of the 2020 presidential election using a model trained on previous presidential election data.

In [3]:
%%html
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Spotted: A map to be hung somewhere in the West Wing <a href="https://t.co/TpPPDyNFtE">pic.twitter.com/TpPPDyNFtE</a></p>&mdash; Trey Yingst (@TreyYingst) <a href="https://twitter.com/TreyYingst/status/862669407868391424?ref_src=twsrc%5Etfw">May 11, 2017</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
In [4]:
winners = {
    'democrat': [1976, 1992, 1996, 2008, 2012],
    'republican': [1980, 1984, 1988, 2000, 2004, 2016]
}
parties = ['democrat', 'republican']
In [5]:
pop_cand = pd.read_csv("1976-2016-president.csv")
pop_cand = pop_cand[pop_cand['writein'] == False]
pop_cand = pop_cand.drop(columns=['office','version', 'notes', 'state_cen', 'state_ic', 'writein'], axis=0)
pop_cand = pop_cand.dropna(axis=0)
pop_cand.head(20)
Out[5]:
year state state_po state_fips candidate party candidatevotes totalvotes
0 1976 Alabama AL 1 Carter, Jimmy democrat 659170 1182850
1 1976 Alabama AL 1 Ford, Gerald republican 504070 1182850
2 1976 Alabama AL 1 Maddox, Lester american independent party 9198 1182850
3 1976 Alabama AL 1 Bubar, Benjamin ""Ben"" prohibition 6669 1182850
4 1976 Alabama AL 1 Hall, Gus communist party use 1954 1182850
5 1976 Alabama AL 1 Macbride, Roger libertarian 1481 1182850
7 1976 Alaska AK 2 Ford, Gerald republican 71555 123574
8 1976 Alaska AK 2 Carter, Jimmy democrat 44058 123574
9 1976 Alaska AK 2 Macbride, Roger libertarian 6785 123574
11 1976 Arizona AZ 4 Ford, Gerald republican 418642 742719
12 1976 Arizona AZ 4 Carter, Jimmy democrat 295602 742719
13 1976 Arizona AZ 4 McCarthy, Eugene ""Gene"" independent 19229 742719
14 1976 Arizona AZ 4 Macbride, Roger libertarian 7647 742719
15 1976 Arizona AZ 4 Camejo, Peter socialist workers 928 742719
16 1976 Arizona AZ 4 Anderson, Thomas J. american 564 742719
17 1976 Arizona AZ 4 Maddox, Lester american independent party 85 742719
19 1976 Arkansas AR 5 Carter, Jimmy democrat 498604 767535
20 1976 Arkansas AR 5 Ford, Gerald republican 267903 767535
22 1976 Arkansas AR 5 Maddox, Lester american independent party 389 767535
23 1976 California CA 6 Ford, Gerald republican 3882244 7803770
In [6]:
pop_cand = pop_cand[pop_cand['party'].isin(parties)]

Data Cleaning

We have used a multitude of datasets in the upcoming analysis. We have found election results by state, electoral votes by state over time, poverty by state over time, and racial distribution of races between states over time. Since there is surprisingly very limited (free) data for the majority of the election years – mainly in the 1900s – we were restricted to using election and socioeconomic data from the 1976 presidential election to the 2016 presidential election.

Our first dataset contains the presidential election data in each state; each candidate has their total number of votes with their name and party. Additionally, there is a column that keeps track of the total number of casted votes in each state for each year. Since we are measuring the representation of candidates among states, this dataset is essential to our analysis.

In [7]:
# useful dicts
fips = {"01": "Alabama",
        "02": "Alaska",
        "04": "Arizona",
        "05": "Arkansas",
        "06": "California",
        "08": "Colorado",
        "09": "Connecticut",
        "10": "Delaware",
        "11": "District of Columbia",
        "12": "Florida",
        "13": "Georgia",
        "15": "Hawaii",
        "16": "Idaho",
        "17": "Illinois",
        "18": "Indiana",
        "19": "Iowa",
        "20": "Kansas",
        "21": "Kentucky",
        "22": "Louisiana",
        "23": "Maine",
        "24": "Maryland",
        "25": "Massachusetts",
        "26": "Michigan",
        "27": "Minnesota",
        "28": "Mississippi",
        "29": "Missouri",
        "30": "Montana",
        "31": "Nebraska",
        "32": "Nevada",
        "33": "New Hampshire",
        "34": "New Jersey",
        "35": "New Mexico",
        "36": "New York",
        "37": "North Carolina",
        "38": "North Dakota",
        "39": "Ohio",
        "40": "Oklahoma",
        "41": "Oregon",
        "42": "Pennsylvania",
        "44": "Rhode Island",
        "45": "South Carolina",
        "46": "South Dakota",
        "47": "Tennessee",
        "48": "Texas",
        "49": "Utah",
        "50": "Vermont",
        "51": "Virginia",
        "53": "Washington",
        "54": "West Virginia",
        "55": "Wisconsin",
        "56": "Wyoming"
       }
us_state_abbrev = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'American Samoa': 'AS',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Guam': 'GU',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Northern Mariana Islands':'MP',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY'
}
state_region = {
        'AK': 'Other',
        'AL': 'South',
        'AR': 'South',
        'AZ': 'West',
        'CA': 'West',
        'CO': 'West',
        'CT': 'NorthEast',
        'DC': 'NorthEast',
        'DE': 'NorthEast',
        'FL': 'South',
        'GA': 'South',
        'HI': 'Other',
        'IA': 'MidWest',
        'ID': 'West',
        'IL': 'MidWest',
        'IN': 'MidWest',
        'KS': 'MidWest',
        'KY': 'South',
        'LA': 'South',
        'MA': 'NorthEast',
        'MD': 'NorthEast',
        'ME': 'NorthEast',
        'MI': 'West',
        'MN': 'MidWest',
        'MO': 'MidWest',
        'MS': 'South',
        'MT': 'West',
        'NC': 'South',
        'ND': 'MidWest',
        'NE': 'West',
        'NH': 'NorthEast',
        'NJ': 'NorthEast',
        'NM': 'West',
        'NV': 'West',
        'NY': 'NorthEast',
        'OH': 'MidWest',
        'OK': 'South',
        'OR': 'West',
        'PA': 'NorthEast',
        'RI': 'NorthEast',
        'SC': 'South',
        'SD': 'MidWest',
        'TN': 'South',
        'TX': 'South',
        'UT': 'West',
        'VA': 'South',
        'VT': 'NorthEast',
        'WA': 'West',
        'WI': 'MidWest',
        'WV': 'South',
        'WY': 'West'
}
def fips_to_state(fips_id):
    fips_id = fips_id[:2]
    return fips[fips_id] if fips_id in fips else 'NaN'
is_party_winner = {
    'democrat': [1976, 1992, 1996, 2008, 2012],
    'republican': [1980, 1984, 1988, 2000, 2004, 2016]
}
TOTAL_ELEC_VOTES = 538
ELECTION_YEARS = list(np.arange(1976, 2017, 4))

Candidate Votes

In [8]:
cand_votes = pd.read_csv("1976-2016-president.csv")
cand_votes = cand_votes[cand_votes['writein'] == False]
cand_votes = cand_votes.drop(columns=['office','version', 'notes', 'state_cen', 'state_ic', 'writein'], axis=0)
cand_votes = cand_votes.dropna(axis=0)
cand_votes.columns = ['year', 'state', 'state_code', 'state_fips', 'cand', 'party', 'cand_votes', 'tot_votes']
cand_votes['party'] = cand_votes['party'].apply(lambda x: 'democrat' if x == 'democratic-farmer-labor' else x)
cand_votes['region'] = cand_votes['state_code'].apply(lambda x: state_region[x])
cand_votes.head(20)
Out[8]:
year state state_code state_fips cand party cand_votes tot_votes region
0 1976 Alabama AL 1 Carter, Jimmy democrat 659170 1182850 South
1 1976 Alabama AL 1 Ford, Gerald republican 504070 1182850 South
2 1976 Alabama AL 1 Maddox, Lester american independent party 9198 1182850 South
3 1976 Alabama AL 1 Bubar, Benjamin ""Ben"" prohibition 6669 1182850 South
4 1976 Alabama AL 1 Hall, Gus communist party use 1954 1182850 South
5 1976 Alabama AL 1 Macbride, Roger libertarian 1481 1182850 South
7 1976 Alaska AK 2 Ford, Gerald republican 71555 123574 Other
8 1976 Alaska AK 2 Carter, Jimmy democrat 44058 123574 Other
9 1976 Alaska AK 2 Macbride, Roger libertarian 6785 123574 Other
11 1976 Arizona AZ 4 Ford, Gerald republican 418642 742719 West
12 1976 Arizona AZ 4 Carter, Jimmy democrat 295602 742719 West
13 1976 Arizona AZ 4 McCarthy, Eugene ""Gene"" independent 19229 742719 West
14 1976 Arizona AZ 4 Macbride, Roger libertarian 7647 742719 West
15 1976 Arizona AZ 4 Camejo, Peter socialist workers 928 742719 West
16 1976 Arizona AZ 4 Anderson, Thomas J. american 564 742719 West
17 1976 Arizona AZ 4 Maddox, Lester american independent party 85 742719 West
19 1976 Arkansas AR 5 Carter, Jimmy democrat 498604 767535 South
20 1976 Arkansas AR 5 Ford, Gerald republican 267903 767535 South
22 1976 Arkansas AR 5 Maddox, Lester american independent party 389 767535 South
23 1976 California CA 6 Ford, Gerald republican 3882244 7803770 West

Electoral Votes

Since the distribution of electoral votes differ depending on the population density of states, it was necessary to obtain electoral vote distributions for each decade that we are analyzing election data. The data was obtained by a graphic that was painstakingly transcriped to an excel file.

https://www.270towin.com/state-electoral-vote-history/

In [9]:
# This has been based off of popular vote, however, how about the electoral college?
electoral = pd.read_excel("electoral-dist-1900-2016.xlsx")
In [10]:
state = []
years = []
votes = []
updated_elec_votes = pd.DataFrame()
for index, row in electoral.iterrows():
    for year in ELECTION_YEARS:
        years.append(year)
        state.append(row['year'])
        votes.append(row[year])
updated_elec_votes['state'] = state
updated_elec_votes['year'] = years
updated_elec_votes['votes'] = votes
updated_elec_votes.columns = ['state', 'year', 'elec_votes']
updated_elec_votes
Out[10]:
state year elec_votes
0 Alabama 1976 9
1 Alabama 1980 9
2 Alabama 1984 9
3 Alabama 1988 9
4 Alabama 1992 9
... ... ... ...
556 Wyoming 2000 3
557 Wyoming 2004 3
558 Wyoming 2008 3
559 Wyoming 2012 3
560 Wyoming 2016 3

561 rows × 3 columns

Race Data

The distribution of race by state data unfortunately took a long time to collect and process. There is an intercensal database that houses race distribution; however, it only reaches back to the 1990s. The 1970’s and 1980’s data also unfortunately used somewhat different structuring in comparison to the other data. The data was split on FIPS state and county codes, which creates an ID for a specific county by prefixing a state-id to it. Thus, the datasets needed to be highly cleaned to conform to the intuitive structure from the intercensal database.

Additionally, we did have to compromise some races. Prior to the 90’s, races were split by White, Black, and Other; however, in the other datasets, races were also split into ‘Native American’ and ‘Pacific Islander’. In order to maintain conformity of the datasets, the ‘Native American’ and ‘Pacific Islander’ data were aggregated into the Other category.

The intercensal data used one column to denote race for each state, meaning that there would be six columns (Male/Female for White, Black, and Other) for each year and state matchup. The dataframe was pivoted to make it more usable for plotting and modelling.

1970s

https://www.census.gov/data/tables/time-series/demo/popest/1970s-state.html

In [11]:
race1970s = pd.read_excel('race-1970s.xlsx')
race1970s.columns = race1970s.iloc[0]
race1970s = race1970s[1:]
race1970s = race1970s.drop('FIPS State Code', axis = 1)
race1970s = race1970s[(race1970s['Year of Estimate'] == '1976')]
race1970s['gender'] = race1970s['Race/Sex Indicator'].apply(lambda x: 'Female' if 'female' in x else 'Male')
race1970s['Race/Sex Indicator'] = race1970s['Race/Sex Indicator'].apply(lambda x: x.split()[0])
race1970s['Year of Estimate'] = race1970s['Year of Estimate'].apply(lambda x: int(x))
In [12]:
cols = list(race1970s.columns[3:-1])
race1970s['total'] = race1970s[cols].sum(1)
race1970s = race1970s.drop(cols, 1)
race1970s.columns = ['year', 'state', 'race', 'gender', 'total']
In [13]:
race1970s
Out[13]:
year state race gender total
1837 1976 Alabama White Male 1347662.0
1838 1976 Alabama White Female 1420902.0
1839 1976 Alabama Black Male 445145.0
1840 1976 Alabama Black Female 510487.0
1841 1976 Alabama Other Male 5927.0
... ... ... ... ... ...
2138 1976 Wyoming White Female 188885.0
2139 1976 Wyoming Black Male 1741.0
2140 1976 Wyoming Black Female 1349.0
2141 1976 Wyoming Other Male 3697.0
2142 1976 Wyoming Other Female 3793.0

306 rows × 5 columns

In [14]:
race1980s = pd.read_excel("race-1980s.xls")
race1980s.columns = race1980s.iloc[0]
race1980s = race1980s[1:]
race1980s['gender'] = race1980s['Race/Sex Indicator'].apply(lambda x: 'Female' if 'female' in x else 'Male')
race1980s['Race/Sex Indicator'] = race1980s['Race/Sex Indicator'].apply(lambda x: x.split()[0])
In [15]:
cols = list(race1980s.columns[3:-1])
race1980s['total'] = race1980s[cols].sum(1)
race1980s = race1980s.drop(cols, 1)
race1980s['FIPS State and County Codes'] = race1980s.apply(lambda x: fips_to_state(x['FIPS State and County Codes']), axis =1)
race1980s.columns = ['year', 'state', 'race', 'gender', 'total']
In [16]:
race1980s = race1980s[race1980s['state'] != 'NaN']
race1980s = race1980s.groupby(['year', 'state', 'race', 'gender'], as_index=False).agg({'total': 'sum'})
race1980s
Out[16]:
year state race gender total
0 1980 Alabama Black Female 533376.0
1 1980 Alabama Black Male 462447.0
2 1980 Alabama Other Female 9992.0
3 1980 Alabama Other Male 8496.0
4 1980 Alabama White Female 1482088.0
... ... ... ... ... ...
913 1988 Wyoming Black Male 2025.0
914 1988 Wyoming Other Female 6203.0
915 1988 Wyoming Other Male 5883.0
916 1988 Wyoming White Female 223558.0
917 1988 Wyoming White Male 225730.0

918 rows × 5 columns

In [17]:
race1992 = pd.read_excel('race-1990s-present.xls')
race1992 = race1992[race1992['Notes'] != 'Total']
race1992 = race1992.drop(['Notes', 'Yearly July 1st Estimates Code', 'State Code', 'Race Code', 'Gender Code'], axis=1)
race1992 = race1992.dropna()
race1992
Out[17]:
State Gender Race Yearly July 1st Estimates Population
0 Alabama Female American Indian or Alaska Native 1992.0 8701
1 Alabama Female American Indian or Alaska Native 1996.0 10317
2 Alabama Female American Indian or Alaska Native 2000.0 12506
3 Alabama Female American Indian or Alaska Native 2004.0 14341
4 Alabama Female American Indian or Alaska Native 2008.0 16513
... ... ... ... ... ...
3409 Wyoming Male White 2000.0 238348
3410 Wyoming Male White 2004.0 245853
3411 Wyoming Male White 2008.0 264369
3412 Wyoming Male White 2012.0 277279
3413 Wyoming Male White 2016.0 280530

2856 rows × 5 columns

In [18]:
race1992['Race'] = race1992['Race'].apply(lambda x: x if 'Indian' not in x and 'Asian' not in x else 'Other')
race1992['Race'] = race1992['Race'].apply(lambda x: x if 'American' not in x else 'Black')
race1992.columns = ['state', 'gender', 'race', 'year', 'total']
race1992 = race1992.groupby(['year', 'state', 'race', 'gender'], as_index = False).agg({'total': 'sum'})
race1992
Out[18]:
year state race gender total
0 1992.0 Alabama Black Female 571306
1 1992.0 Alabama Black Male 489321
2 1992.0 Alabama Other Female 21547
3 1992.0 Alabama Other Male 20160
4 1992.0 Alabama White Female 1567806
... ... ... ... ... ...
2137 2016.0 Wyoming Black Male 5705
2138 2016.0 Wyoming Other Female 12355
2139 2016.0 Wyoming Other Male 12162
2140 2016.0 Wyoming White Female 269349
2141 2016.0 Wyoming White Male 280530

2142 rows × 5 columns

Combining Race Datasets

In [19]:
socioeconomic = pd.concat([race1970s, race1980s, race1992], ignore_index = True)
socioeconomic
Out[19]:
year state race gender total
0 1976.0 Alabama White Male 1347662.0
1 1976.0 Alabama White Female 1420902.0
2 1976.0 Alabama Black Male 445145.0
3 1976.0 Alabama Black Female 510487.0
4 1976.0 Alabama Other Male 5927.0
... ... ... ... ... ...
3361 2016.0 Wyoming Black Male 5705.0
3362 2016.0 Wyoming Other Female 12355.0
3363 2016.0 Wyoming Other Male 12162.0
3364 2016.0 Wyoming White Female 269349.0
3365 2016.0 Wyoming White Male 280530.0

3366 rows × 5 columns

In [20]:
socioeconomic['ragender'] = socioeconomic['race'] + socioeconomic['gender']
socioeconomic = socioeconomic.drop(['race', 'gender'], axis = 1)
In [21]:
socioeconomic = socioeconomic.pivot(columns = ['ragender'], values = ['total'], index = ['year', 'state'])
socioeconomic = socioeconomic.reset_index(level = [0, 1])
socioeconomic.columns = socioeconomic.columns.map(''.join)
socioeconomic.columns = ['year', 'state', 'BlackFemale', 'BlackMale', 'OtherFemale', 'OtherMale', 'WhiteFemale', 'WhiteMale']
socioeconomic
Out[21]:
year state BlackFemale BlackMale OtherFemale OtherMale WhiteFemale WhiteMale
0 1976.0 Alabama 510487.0 445145.0 6915.0 5927.0 1420902.0 1347662.0
1 1976.0 Alaska 5757.0 7413.0 34292.0 35134.0 144438.0 165964.0
2 1976.0 Arizona 32258.0 34327.0 76195.0 72216.0 1080988.0 1051986.0
3 1976.0 Arkansas 192443.0 170722.0 6062.0 5364.0 918985.0 875080.0
4 1976.0 California 845847.0 808672.0 578346.0 570557.0 9701334.0 9429848.0
... ... ... ... ... ... ... ... ...
556 2016.0 Virginia 903868.0 836060.0 342452.0 314307.0 3025940.0 2987479.0
557 2016.0 Washington 175635.0 200788.0 479073.0 440838.0 2992548.0 3005889.0
558 2016.0 West Virginia 34468.0 40188.0 11346.0 10737.0 879595.0 854689.0
559 2016.0 Wisconsin 212928.0 206670.0 128193.0 124037.0 2561241.0 2539559.0
560 2016.0 Wyoming 4114.0 5705.0 12355.0 12162.0 269349.0 280530.0

561 rows × 8 columns

In [22]:
poverty = pd.read_excel("poverty-by-state.xlsx")
poverty.columns = ['state', 'year', 'total', 'poor', 'percent']
poverty = poverty[1:]
poverty['percent'] = poverty['percent'] / 100
poverty.columns = ['state', 'year', 'total_families', 'poor_families', 'percent']
poverty
Out[22]:
state year total_families poor_families percent
1 Alabama 2016 4806 723 0.15
2 Alaska 2016 717 103 0.144
3 Arizona 2016 6990 926 0.132
4 Arkansas 2016 2924 432 0.148
5 California 2016 39237 4872 0.124
... ... ... ... ... ...
557 Virginia 1976 5204 647 0.124
558 Washington 1976 4223 538 0.127
559 West Virginia 1976 1952 297 0.152
560 Wisconsin 1976 4724 403 0.085
561 Wyoming 1976 468 49 0.104

561 rows × 5 columns

Candidate Wins by State per Year

A useful dataframe for our analysis is calculating the candidate winners for each state per each election year (see the map below). The original dataset was aggregated to keep only the candidate who won the state, by selecting the entry with maximum votes.

In [23]:
agg_scheme = {
    'party': 'first',
    'cand': 'first',
    'cand_votes': 'max',
    'tot_votes': 'first',
    'state_code': 'first'
}
wins_by_state = cand_votes.groupby(['year', 'state'], as_index=False).agg(agg_scheme)
wins_by_state['party_class'] = wins_by_state['party'].apply(lambda x: 0 if x == 'democrat' else 1)
wins_by_state.columns = ['year', 'state', 'party', 'cand', 'cand_votes', 'tot_votes', 'state_code', 'party_class']
wins_by_state
Out[23]:
year state party cand cand_votes tot_votes state_code party_class
0 1976 Alabama democrat Carter, Jimmy 659170 1182850 AL 0
1 1976 Alaska republican Ford, Gerald 71555 123574 AK 1
2 1976 Arizona republican Ford, Gerald 418642 742719 AZ 1
3 1976 Arkansas democrat Carter, Jimmy 498604 767535 AR 0
4 1976 California republican Ford, Gerald 3882244 7803770 CA 1
... ... ... ... ... ... ... ... ...
556 2016 Virginia democrat Clinton, Hillary 1981473 3982752 VA 0
557 2016 Washington democrat Clinton, Hillary 1742718 3317019 WA 0
558 2016 West Virginia republican Trump, Donald J. 489371 713051 WV 1
559 2016 Wisconsin republican Trump, Donald J. 1405284 2976150 WI 1
560 2016 Wyoming republican Trump, Donald J. 174419 258788 WY 1

561 rows × 8 columns

In [24]:
socioeconomic = pd.concat([race1970s, race1980s, race1992], ignore_index = True)
socioeconomic
Out[24]:
year state race gender total
0 1976.0 Alabama White Male 1347662.0
1 1976.0 Alabama White Female 1420902.0
2 1976.0 Alabama Black Male 445145.0
3 1976.0 Alabama Black Female 510487.0
4 1976.0 Alabama Other Male 5927.0
... ... ... ... ... ...
3361 2016.0 Wyoming Black Male 5705.0
3362 2016.0 Wyoming Other Female 12355.0
3363 2016.0 Wyoming Other Male 12162.0
3364 2016.0 Wyoming White Female 269349.0
3365 2016.0 Wyoming White Male 280530.0

3366 rows × 5 columns

In [25]:
poverty = pd.read_excel("poverty-by-state.xlsx")
poverty.columns = ['state', 'year', 'total', 'poor', 'percent']
poverty = poverty[1:]
poverty['percent'] = poverty['percent'] / 100
poverty
Out[25]:
state year total poor percent
1 Alabama 2016 4806 723 0.15
2 Alaska 2016 717 103 0.144
3 Arizona 2016 6990 926 0.132
4 Arkansas 2016 2924 432 0.148
5 California 2016 39237 4872 0.124
... ... ... ... ... ...
557 Virginia 1976 5204 647 0.124
558 Washington 1976 4223 538 0.127
559 West Virginia 1976 1952 297 0.152
560 Wisconsin 1976 4724 403 0.085
561 Wyoming 1976 468 49 0.104

561 rows × 5 columns

In [26]:
state_voted_for = pop_cand.copy(deep = True)
state_voted_for = state_voted_for.groupby(['year', 'state']).agg({'candidatevotes': 'max', 'party': 'first'}).reset_index().drop('candidatevotes', axis=1)
In [27]:
socioeconomic['ragender'] = socioeconomic['race'] + socioeconomic['gender']
socioeconomic = socioeconomic.drop(['race', 'gender'], axis = 1)
In [28]:
socioeconomic = socioeconomic.pivot(columns = ['ragender'], values = ['total'], index = ['year', 'state'])
socioeconomic = socioeconomic.reset_index(level = [0, 1])
socioeconomic.columns = socioeconomic.columns.map(''.join)
socioeconomic.columns = ['year', 'state', 'BlackFemale', 'BlackMale', 'OtherFemale', 'OtherMale', 'WhiteFemale', 'WhiteMale']
socioeconomic
Out[28]:
year state BlackFemale BlackMale OtherFemale OtherMale WhiteFemale WhiteMale
0 1976.0 Alabama 510487.0 445145.0 6915.0 5927.0 1420902.0 1347662.0
1 1976.0 Alaska 5757.0 7413.0 34292.0 35134.0 144438.0 165964.0
2 1976.0 Arizona 32258.0 34327.0 76195.0 72216.0 1080988.0 1051986.0
3 1976.0 Arkansas 192443.0 170722.0 6062.0 5364.0 918985.0 875080.0
4 1976.0 California 845847.0 808672.0 578346.0 570557.0 9701334.0 9429848.0
... ... ... ... ... ... ... ... ...
556 2016.0 Virginia 903868.0 836060.0 342452.0 314307.0 3025940.0 2987479.0
557 2016.0 Washington 175635.0 200788.0 479073.0 440838.0 2992548.0 3005889.0
558 2016.0 West Virginia 34468.0 40188.0 11346.0 10737.0 879595.0 854689.0
559 2016.0 Wisconsin 212928.0 206670.0 128193.0 124037.0 2561241.0 2539559.0
560 2016.0 Wyoming 4114.0 5705.0 12355.0 12162.0 269349.0 280530.0

561 rows × 8 columns

In [29]:
state_voted_for
Out[29]:
year state party
0 1976 Alabama democrat
1 1976 Alaska republican
2 1976 Arizona republican
3 1976 Arkansas democrat
4 1976 California republican
... ... ... ...
556 2016 Virginia democrat
557 2016 Washington democrat
558 2016 West Virginia republican
559 2016 Wisconsin republican
560 2016 Wyoming republican

561 rows × 3 columns

In [30]:
socioeconomic = socioeconomic.merge(state_voted_for, on = ['year', 'state'])
In [31]:
poverty.columns = ['state', 'year', 'total_families', 'poor_families', 'percent']
socioeconomic = socioeconomic.merge(poverty, on=['state', 'year'])
socioeconomic
Out[31]:
year state BlackFemale BlackMale OtherFemale OtherMale WhiteFemale WhiteMale party total_families poor_families percent
0 1976 Alabama 510487.0 445145.0 6915.0 5927.0 1420902.0 1347662.0 democrat 3831 810 0.212
1 1976 Alaska 5757.0 7413.0 34292.0 35134.0 144438.0 165964.0 republican 379 36 0.096
2 1976 Arizona 32258.0 34327.0 76195.0 72216.0 1080988.0 1051986.0 republican 2774 354 0.128
3 1976 Arkansas 192443.0 170722.0 6062.0 5364.0 918985.0 875080.0 democrat 2249 484 0.215
4 1976 California 845847.0 808672.0 578346.0 570557.0 9701334.0 9429848.0 republican 23748 2619 0.11
... ... ... ... ... ... ... ... ... ... ... ... ...
556 2016 Virginia 903868.0 836060.0 342452.0 314307.0 3025940.0 2987479.0 democrat 8249 847 0.103
557 2016 Washington 175635.0 200788.0 479073.0 440838.0 2992548.0 3005889.0 democrat 7431 736 0.099
558 2016 West Virginia 34468.0 40188.0 11346.0 10737.0 879595.0 854689.0 republican 1794 311 0.173
559 2016 Wisconsin 212928.0 206670.0 128193.0 124037.0 2561241.0 2539559.0 republican 5808 551 0.095
560 2016 Wyoming 4114.0 5705.0 12355.0 12162.0 269349.0 280530.0 republican 560 70 0.124

561 rows × 12 columns

Before we begin...

There are many parties that exist in the United States and it is crucial to know which parties represent most of the votes. We plotted a bar graph below to understand this at a larger scale. We can clearly see that the democratic and republican parties represent most of the candidate votes and therefore, will be the parties that are analyzed in this entire project.

In [32]:
parties = cand_votes.groupby(['party']).agg({'cand_votes': 'sum'})
parties.nlargest(20, 'cand_votes').plot(kind='bar', figsize=(20, 10))
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc5080340a0>
In [33]:
valid_parties = ['democrat', 'republican']
cand_votes = cand_votes[cand_votes['party'].isin(valid_parties)]
wins_by_state = wins_by_state[wins_by_state['party'].isin(valid_parties)]

Role of States in Elections

In [34]:
scl = [[0, '#2980b9'],[1, '#e74c3c']] 
def plot_votes_map(arg):
    data_slider = []
    for year in ELECTION_YEARS:
        df = wins_by_state[(wins_by_state['state'] != 'District of Columbia') & (wins_by_state['year'] == year)]

        by_year = dict(
            type='choropleth',
            locations=df['state_code'],
            z=df[arg].astype(float),
            locationmode='USA-states',
            colorscale=scl,
        )

        data_slider.append(by_year)
    
    steps = []
    count = 1976

    for i in range(len(data_slider)):
        step = dict(method='restyle',
                    args=['visible', [False] * len(data_slider)],
                    label='Year {}'.format(count)) # label to be displayed for each step (year)
        step['args'][1][i] = True
        steps.append(step)
        count += 4

    sliders = [dict(active=10, pad={"t": 1}, steps=steps)]
    
    layout = dict(
        geo=dict(scope='usa', projection={'type': 'albers usa'}),
        sliders=sliders
    )
    
    fig = dict(data=data_slider, layout=layout)
    plotly.offline.iplot(fig, show_link=False)

To analyze the roles of states in the electoral college, it’s best to see the election results on our interactive map (use the slider to go across years). One thing to point out is the 1992 election, it seems that Ronald Reagan held a landslide victory; however, we will soon see that this could not be further from the truth.

In [35]:
plot_votes_map('party_class')

State Wins vs Election Wins

To analyze the winning candidate’s influence on each state, the original dataset was filtered to contain only the winning candidates votes for each state (i.e., only Trump for the 2016 election, etc.). Next, an important statistic for analysis, win proportion, was calculated. Essentially, it’s just the votes for a candidate divided by the total number of votes.

In [36]:
def win_or_lose(row):
    return 'W' if row['year'] in is_party_winner[row['party']] else 'L'
# Show distribution of votes by state
misleading = cand_votes.copy(deep=True)
misleading['outcome'] = misleading.apply(win_or_lose, axis=1)
misleading = misleading[misleading['outcome'] == 'W'].drop('outcome', axis=1)
misleading['win_prop'] = misleading['cand_votes']/misleading['tot_votes']
misleading
Out[36]:
year state state_code state_fips cand party cand_votes tot_votes region win_prop
0 1976 Alabama AL 1 Carter, Jimmy democrat 659170 1182850 South 0.557273
8 1976 Alaska AK 2 Carter, Jimmy democrat 44058 123574 Other 0.356531
12 1976 Arizona AZ 4 Carter, Jimmy democrat 295602 742719 West 0.398000
19 1976 Arkansas AR 5 Carter, Jimmy democrat 498604 767535 South 0.649617
24 1976 California CA 6 Carter, Jimmy democrat 3742284 7803770 West 0.479548
... ... ... ... ... ... ... ... ... ... ...
3705 2016 Virginia VA 51 Trump, Donald J. republican 1769443 3982752 South 0.444276
3711 2016 Washington WA 53 Trump, Donald J. republican 1221747 3317019 West 0.368327
3718 2016 West Virginia WV 54 Trump, Donald J. republican 489371 713051 South 0.686306
3723 2016 Wisconsin WI 55 Trump, Donald J. republican 1405284 2976150 MidWest 0.472182
3732 2016 Wyoming WY 56 Trump, Donald J. republican 174419 258788 West 0.673984

561 rows × 10 columns

Calculating this statewide on each year, we can see how the winning distribution is actually centered in each of these elections. A simple scatterplot shows that there seems to be a bit of normality of win proportions for a candidate in their election years, with a few persistent outliers. It can be inferred that states are more divided than we think they are.

In [37]:
fig, ax = plt.subplots(figsize=(10,5))
ax.scatter(misleading['year'], misleading['win_prop'])
Out[37]:
<matplotlib.collections.PathCollection at 0x7fc502115160>

To better see this relationship, a violin plot of the same data was constructed. It’s quite evident that there seems to be a normal distribution in the earlier years, but it flattens out in recent years. The effect of the outlier can also be seen.

In [38]:
# More centered in earlier years, but states differ by a lot in the future; mention outliers
sns.violinplot(x = misleading['year'], y = misleading['win_prop'])
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc5020ec6a0>

Standardizing the Win Proportions

To attempt to minimize some of this noise, we will standardize the win proportion with respect to each election year. Already, you can see that the standard deviations are larger in later years in comparison to earlier years.

In [39]:
# Standardize win_proportion
wp_standardized = misleading.groupby('year', as_index = False).agg({'year': 'first', 'win_prop': ['mean', 'std']})
wp_standardized.columns = wp_standardized.columns.map('|'.join).str.strip('|')
wp_standardized
Out[39]:
year|first win_prop|mean win_prop|std
0 1976 0.494978 0.079144
1 1980 0.513081 0.090927
2 1984 0.597078 0.087330
3 1988 0.536626 0.079213
4 1992 0.420150 0.084765
5 1996 0.479090 0.084874
6 2000 0.495750 0.105584
7 2004 0.522200 0.104293
8 2008 0.512654 0.110086
9 2012 0.489411 0.117291
10 2016 0.482428 0.120885
In [40]:
# calculate standardized for EACH YEAR
values = []
for index, row in misleading.iterrows():
    election_year = wp_standardized[wp_standardized['year|first'] == row['year']]
    mean = float(election_year['win_prop|mean'])
    std = float(election_year['win_prop|std'])
    calc = (row['win_prop'] - mean)/std
    values.append(calc)
misleading['stand_win_prop'] = values
misleading
Out[40]:
year state state_code state_fips cand party cand_votes tot_votes region win_prop stand_win_prop
0 1976 Alabama AL 1 Carter, Jimmy democrat 659170 1182850 South 0.557273 0.787110
8 1976 Alaska AK 2 Carter, Jimmy democrat 44058 123574 Other 0.356531 -1.749304
12 1976 Arizona AZ 4 Carter, Jimmy democrat 295602 742719 West 0.398000 -1.225340
19 1976 Arkansas AR 5 Carter, Jimmy democrat 498604 767535 South 0.649617 1.953905
24 1976 California CA 6 Carter, Jimmy democrat 3742284 7803770 West 0.479548 -0.194957
... ... ... ... ... ... ... ... ... ... ... ...
3705 2016 Virginia VA 51 Trump, Donald J. republican 1769443 3982752 South 0.444276 -0.315604
3711 2016 Washington WA 53 Trump, Donald J. republican 1221747 3317019 West 0.368327 -0.943885
3718 2016 West Virginia WV 54 Trump, Donald J. republican 489371 713051 South 0.686306 1.686545
3723 2016 Wisconsin WI 55 Trump, Donald J. republican 1405284 2976150 MidWest 0.472182 -0.084761
3732 2016 Wyoming WY 56 Trump, Donald J. republican 174419 258788 West 0.673984 1.584616

561 rows × 11 columns

By standardizing, we can easily find which states seem to be the outliers. Unsurprisingly, the majority of the outliers are the District of Columbia, which results in a resounding democrat vote for each election year. Additionally, we can see that Utah opposes the District of Columbia by sticking with the Republican party.

In [41]:
outliers = misleading[(misleading['stand_win_prop'] < -2) | (misleading['stand_win_prop'] > 2)]
outliers
Out[41]:
year state state_code state_fips cand party cand_votes tot_votes region win_prop stand_win_prop
50 1976 District of Columbia DC 11 Carter, Jimmy democrat 137818 168830 NorthEast 0.816312 4.060135
60 1976 Georgia GA 13 Carter, Jimmy democrat 979409 1463152 South 0.669383 2.203649
288 1976 Utah UT 49 Carter, Jimmy democrat 182110 541218 West 0.336482 -2.002634
389 1980 District of Columbia DC 11 Reagan, Ronald republican 23313 173889 NorthEast 0.134068 -4.168306
644 1980 Utah UT 49 Reagan, Ronald republican 439687 604152 West 0.727775 2.361165
746 1984 District of Columbia DC 11 Reagan, Ronald republican 29009 211288 NorthEast 0.137296 -5.264901
1062 1988 District of Columbia DC 11 Bush, George H.W. republican 27590 192877 NorthEast 0.143045 -4.968663
1358 1992 District of Columbia DC 11 Clinton, Bill democrat 192619 227572 NorthEast 0.846409 5.028742
1618 1992 Utah UT 49 Clinton, Bill democrat 183429 743998 West 0.246545 -2.048079
1743 1996 District of Columbia DC 11 Clinton, Bill democrat 158220 185726 NorthEast 0.851900 4.392489
2101 2000 District of Columbia DC 11 Bush, George W. republican 18073 201894 NorthEast 0.089517 -3.847479
2466 2004 District of Columbia DC 11 Bush, George W. republican 21256 227586 NorthEast 0.093398 -4.111518
2785 2008 District of Columbia DC 11 Obama, Barack H. democrat 245800 265853 NorthEast 0.924571 3.741769
3130 2012 District of Columbia DC 11 Obama, Barack H. democrat 267070 293764 NorthEast 0.909131 3.578446
3351 2012 Utah UT 49 Obama, Barack H. democrat 251813 1017440 West 0.247497 -2.062515
3461 2016 District of Columbia DC 11 Trump, Donald J. republican 12723 312575 NorthEast 0.040704 -3.654094

Overall, the votes don’t seem to sway one way or another. According to the boxplot, the median is perfectly in the middle, and the IQ ranges seem to be rather symmetrical.

In [42]:
# In general, there is very little deviation between candidates, and the outliers are relatively consistent
fig, ax = plt.subplots(figsize=(10,5))
ax = sns.boxplot(x=misleading['stand_win_prop'])
ax.axes.set_title("Standardized Win Proportion", fontsize=15)
Out[42]:
Text(0.5, 1.0, 'Standardized Win Proportion')

The new violinplot shows that even when accounting for outliers, there seems to be slightly more deviation between states in recent years than in prior years. It also shows that the majority of distributions for each election year tend to stay very consistent, and no candidate really swings the majority of America.

In [43]:
fig, ax = plt.subplots(figsize=(10,10))
ax = sns.violinplot(x= misleading['year'], y= misleading['stand_win_prop'])
ax.axes.set_title("Win proportion over Time", fontsize=15)
ax.set_xlabel("Year", fontsize=15)
ax.set_ylabel("Standardized Win Proportion", fontsize=15)
Out[43]:
Text(0, 0.5, 'Standardized Win Proportion')

The boxplot shows another representation of the standardized data. It does support the violinplot by showing that no candidate outright has a ‘landslide victory’, but it does show that even the deviation in recent years is not as profound. The deviation tends to increase in the 25% range, but the 75% range seems to stay relatively consistent.

In [44]:
fig, ax = plt.subplots(figsize=(10,10))
ax = sns.boxplot(x='year', y='stand_win_prop', data=misleading)
ax.axes.set_title("Win proportion over Time", fontsize=15)
ax.set_xlabel("Year", fontsize=15)
ax.set_ylabel("Standardized Win Proportion", fontsize=15)
Out[44]:
Text(0, 0.5, 'Standardized Win Proportion')

States and Parties

Based solely on popularity, it does not seem that there is much sway between elections. However, when splitting wins up by state we can see where common misconceptions arise. In the next step of analysis, we can see that even though the popular votes for each election were pretty evenly split, the statewide splits are very far off.

In [45]:
democrat_year_wins = wins_by_state.groupby('year', as_index=False).agg({'party_class': 'sum'})
democrat_year_wins['dem_avg'] = democrat_year_wins['party_class'] / 51
fig, ax = plt.subplots(figsize=(10,10))
ax.bar(x='year', height='dem_avg', data=democrat_year_wins)
ax.set_xticks(ticks=np.arange(1976, 2017, 4))
ax.set_xlabel("Year", fontsize=15)
ax.set_ylabel("Democrat States", fontsize=15)
ax.set_title('Electoral Votes vs Representative Electoral Votes', fontsize=15)
Out[45]:
Text(0.5, 1.0, 'Electoral Votes vs Representative Electoral Votes')
In [46]:
democrat_state_wins = wins_by_state.groupby('state', as_index=False).agg({'party_class': 'sum'})
democrat_state_wins['dem_avg'] = democrat_state_wins['party_class'] / len(ELECTION_YEARS)

Because of this, on average, we can see that states tended to vote Republican more than Democrat (the majority of the winners from 1976-2016 presidential elections are Republicans). However, as we saw before, this is a poor representation of the voting population.

In [47]:
# Reversed 1 and 0 because of party_class
# Biased towards republicans, this makes sense becxause the election year data consisted of republican winners
fig, ax = plt.subplots(figsize=(10, 5))
ax = sns.boxplot(x = democrat_state_wins['dem_avg'])
ax.set_title('Probability of States Voting Republican')
ax.set_xlabel('Probability')
Out[47]:
Text(0.5, 0, 'Probability')

Win Proportion on Electoral Vote Win Proportion

To analyze the effect of the electoral college on elections, each candidate’s winning proportion in each state has been plotted on the number of electoral votes that state won over. As can be seen, the majority of large electoral states have been clustered towards the 50% win margin. Thus, a lot of the skew occurs because all of the minority candidate’s votes essentially disappear in the “winner takes all” system.

In [48]:
# merge electoral votes to previous dataframe
def win(row):
    return row['elec_votes'] if row['year'] in is_party_winner[row['party']] else 0

agg_scheme = {
    'candidatevotes': 'max', 
    'votes': 'first',
    'totalvotes': 'first',
    'party': 'first'
}
popular_electoral = wins_by_state.merge(updated_elec_votes, on=['state', 'year'])
popular_electoral['won_elec_votes'] = popular_electoral.apply(win, axis=1)
popular_electoral['win_prop'] = popular_electoral['cand_votes'] / popular_electoral['tot_votes'] 
popular_electoral['exp_elec_votes'] = popular_electoral['win_prop'] * popular_electoral['elec_votes']
popular_electoral.head(3)
Out[48]:
year state party cand cand_votes tot_votes state_code party_class elec_votes won_elec_votes win_prop exp_elec_votes
0 1976 Alabama democrat Carter, Jimmy 659170 1182850 AL 0 9 9 0.557273 5.015454
1 1976 Alaska republican Ford, Gerald 71555 123574 AK 1 3 0 0.579046 1.737137
2 1976 Arizona republican Ford, Gerald 418642 742719 AZ 1 6 0 0.563661 3.381968
In [49]:
i = 0
j = 0
fig, ax = plt.subplots(nrows=4, ncols=3)
fig.add_subplot(111, frameon=False)
plt.tick_params(labelcolor='none', top=False, bottom=False, left=False, right=False)

fig.set_figheight(20)
fig.set_figwidth(20)
fig.delaxes(ax[3][2])
plt.xlabel('Win Proportion', fontsize=20)
plt.ylabel('Electoral Votes', fontsize=20)
for year in ELECTION_YEARS:
    election_year = popular_electoral[popular_electoral['year'] == year].reset_index()
    ax[i][j].scatter(election_year['win_prop'], y = election_year['elec_votes'])
    ax[i][j].set_title(f'{year}')
    if i == 3:
        j = j + 1
        i = 0
    else:
        i = i + 1
In [50]:
# aggregate by year
agg_scheme = {
    'won_elec_votes': 'sum', 
    'cand_votes': 'sum', 
    'tot_votes': 'sum', 
    'exp_elec_votes': 'sum'
}
pop_elec_year = popular_electoral.groupby('year', as_index=False).agg(agg_scheme)
pop_elec_year['win_prop'] = pop_elec_year['cand_votes'] / pop_elec_year['tot_votes'] 
pop_elec_year['elec_prop'] = pop_elec_year['won_elec_votes'] / TOTAL_ELEC_VOTES
pop_elec_year = pop_elec_year.drop(['tot_votes', 'cand_votes'], axis=1)
pop_elec_year
Out[50]:
year won_elec_votes exp_elec_votes win_prop elec_prop
0 1976 297 281.816869 0.519522 0.552045
1 1980 448 276.953179 0.511996 0.832714
2 1984 525 318.271524 0.586283 0.975836
3 1988 426 295.449200 0.545360 0.791822
4 1992 370 238.947437 0.442866 0.687732
5 1996 379 273.392584 0.507138 0.704461
6 2000 271 291.738360 0.537609 0.503717
7 2004 286 300.566930 0.552954 0.531599
8 2008 364 306.557956 0.565152 0.676580
9 2012 332 305.926798 0.561814 0.617100
10 2016 305 294.150076 0.541241 0.566914

Further aggregating it by year, it can be seen that there seems to be more of an exponential curve between a candidate’s popular vote win vs their electoral win.

In [51]:
fig, ax = plt.subplots(figsize=(7,5))
ax.scatter(data=pop_elec_year, x = 'win_prop', y = 'elec_prop')
ax.set_xlabel('Popular Vote Proportion', fontsize=12)
ax.set_ylabel('Electoral Vote Proportion', fontsize=12)
ax.set_title('Popular Vote on Electoral Vote from 1976-2016', fontsize=15)
Out[51]:
Text(0.5, 1.0, 'Popular Vote on Electoral Vote from 1976-2016')

However, by converting a candidate’s electoral votes based on their popular vote (E(electoral votes) = win proportion * electoral_votes) for each state, and aggregating them countrywide, there is a more nuanced outcome.

In [52]:
def label(rects):
    for rect in rects:
        height = rect.get_height()
        ax.annotate('{}'.format(height),
                   xy=(rect.get_x() + .4, height),
                   xytext=(0, 3),
                   textcoords='offset points',
                   ha='center', va='bottom')
def tick_size(ticks):
    for tick in ticks:
        tick.label.set_fontsize(14)

pop_elec_year['round_exp'] = pop_elec_year['exp_elec_votes'].apply(lambda x: int(x))
pop_elec_year['round_act'] = pop_elec_year['won_elec_votes'].apply(lambda x: int(x))
pop_elec_year['round_prop'] = pop_elec_year['win_prop'].apply(lambda x: int(x * 538))
fig, ax = plt.subplots(figsize=(18,10))
rep = ax.bar(x=pop_elec_year['year'] - 1, height=pop_elec_year['round_exp'], label='Representative')
act = ax.bar(x=pop_elec_year['year'], height=pop_elec_year['round_act'], label='Actual')
prop = ax.bar(x=pop_elec_year['year'] + 1, height=pop_elec_year['round_prop'], label='Popular Vote (scaled)')

ax.set_xticks(ticks=np.arange(1976, 2017, 4))
label(rep)
label(act)
label(prop)
tick_size(ax.xaxis.get_major_ticks())
tick_size(ax.yaxis.get_major_ticks())
ax.axhline(y=270, ls='--', c='black')

ax.set_xlabel('Year', fontsize=17)
ax.set_ylabel('Electoral Votes', fontsize=14)
ax.set_title('Electoral Votes vs Representative Electoral Votes', fontsize=15)
ax.legend(loc='upper right', fontsize=17)
Out[52]:
<matplotlib.legend.Legend at 0x7fc4ff85ed30>

The Effect of Race on Election Results

In [53]:
socioeconomic['region'] = socioeconomic['state'].apply(lambda x: state_region[us_state_abbrev[x]])
soc_reg = socioeconomic.merge(wins_by_state[['year', 'state', 'state_code', 'party_class']], on=['year', 'state'])
soc_reg = soc_reg.merge(updated_elec_votes, on=['year', 'state'])
soc_reg['total_pop'] = soc_reg['BlackMale'] + soc_reg['BlackFemale'] + soc_reg['WhiteMale'] + soc_reg['WhiteFemale'] + soc_reg['OtherMale'] + soc_reg['OtherFemale']
soc_reg
Out[53]:
year state BlackFemale BlackMale OtherFemale OtherMale WhiteFemale WhiteMale party total_families poor_families percent region state_code party_class elec_votes total_pop
0 1976 Alabama 510487.0 445145.0 6915.0 5927.0 1420902.0 1347662.0 democrat 3831 810 0.212 South AL 0 9 3737038.0
1 1976 Alaska 5757.0 7413.0 34292.0 35134.0 144438.0 165964.0 republican 379 36 0.096 Other AK 1 3 392998.0
2 1976 Arizona 32258.0 34327.0 76195.0 72216.0 1080988.0 1051986.0 republican 2774 354 0.128 West AZ 1 6 2347970.0
3 1976 Arkansas 192443.0 170722.0 6062.0 5364.0 918985.0 875080.0 democrat 2249 484 0.215 South AR 0 6 2168656.0
4 1976 California 845847.0 808672.0 578346.0 570557.0 9701334.0 9429848.0 republican 23748 2619 0.11 West CA 1 45 21934604.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
556 2016 Virginia 903868.0 836060.0 342452.0 314307.0 3025940.0 2987479.0 democrat 8249 847 0.103 South VA 0 13 8410106.0
557 2016 Washington 175635.0 200788.0 479073.0 440838.0 2992548.0 3005889.0 democrat 7431 736 0.099 West WA 0 12 7294771.0
558 2016 West Virginia 34468.0 40188.0 11346.0 10737.0 879595.0 854689.0 republican 1794 311 0.173 South WV 1 5 1831023.0
559 2016 Wisconsin 212928.0 206670.0 128193.0 124037.0 2561241.0 2539559.0 republican 5808 551 0.095 MidWest WI 1 10 5772628.0
560 2016 Wyoming 4114.0 5705.0 12355.0 12162.0 269349.0 280530.0 republican 560 70 0.124 West WY 1 3 584215.0

561 rows × 17 columns

First, looking at the distribution of race on votes, one can see that in winning states, white and black people tend to vote for Republicans (in recent years). Keep in mind, since we weren't able to find voter turnout data, this is just demographics of states that voted for each party. Since minorities tend to vote Democrat, this may be evidence of some sort of voter suppression.

https://www.pewresearch.org/politics/2018/03/20/1-trends-in-party-affiliation-among-demographic-groups/

In [54]:
def split(group):
    return re.sub(r"(\w)([A-Z])", r"\1 \2", group)
fig, ax = plt.subplots(nrows=3, ncols=2, figsize=(20,15))
cols =[['WhiteFemale', 'WhiteMale'], ['BlackFemale', 'BlackMale'], ['OtherFemale', 'OtherMale']]
i = 0
j = 0
while j < 2:
    plot = sns.violinplot(ax=ax[i][j], x='year', y=cols[i][j], hue='party', data=socioeconomic, split=True)
    plot.set_title(f'Distribution of Votes for {split(cols[i][j])}s')
    plot.set_xlabel('Year')
    if i == 2:
        i = 0
        j = j + 1
    else:
        i = i + 1

States with High Race Densities

In [55]:
race_df = soc_reg
race_df['black'] = race_df['BlackFemale'] + race_df['BlackMale']
race_df['white'] = race_df['WhiteFemale'] + race_df['WhiteMale']
race_df['other'] = race_df['OtherFemale'] + race_df['OtherMale']
race_df['black_std'] = race_df['black'] / race_df['total_pop']
race_df['white_std'] = race_df['white'] / race_df['total_pop']
race_df['other_std'] = race_df['other'] / race_df['total_pop']
race_df = race_df.drop(columns=['total_families', 'poor_families', 'BlackFemale', 'BlackMale', 'WhiteFemale', 'WhiteMale', 'OtherFemale', 'OtherMale'])
for i, row in race_df.iterrows():
    race_df.loc[i, 'state_code'] = us_state_abbrev[row['state']]
In [56]:
race_df.nlargest(10, ['black'])
Out[56]:
year state party percent region state_code party_class elec_votes total_pop black white other black_std white_std other_std
553 2016 Texas republican 0.134 South TX 1 38 27914410.0 3638427.0 22485942.0 1790041.0 0.130342 0.805532 0.064126
542 2016 New York democrat 0.134 NorthEast NY 0 29 19633428.0 3631433.0 13982801.0 2019194.0 0.184962 0.712194 0.102845
519 2016 Florida republican 0.137 South FL 1 29 20613477.0 3598428.0 16235250.0 779799.0 0.174567 0.787604 0.037830
491 2012 New York democrat 0.173 NorthEast NY 0 29 19572932.0 3583150.0 14119939.0 1869843.0 0.183067 0.721401 0.095532
440 2008 New York democrat 0.16 NorthEast NY 0 31 19212436.0 3473867.0 14057786.0 1680783.0 0.180813 0.731702 0.087484
389 2004 New York democrat 0.14 NorthEast NY 0 31 19171567.0 3464246.0 14222791.0 1484530.0 0.180697 0.741869 0.077434
338 2000 New York democrat 0.14 NorthEast NY 0 33 19001780.0 3443462.0 14266524.0 1291794.0 0.181218 0.750799 0.067983
520 2016 Georgia republican 0.133 South GA 1 16 10301890.0 3381419.0 6405733.0 514738.0 0.328233 0.621802 0.049965
468 2012 Florida democrat 0.148 South FL 0 29 19297822.0 3324305.0 15290539.0 682978.0 0.172263 0.792345 0.035391
287 1996 New York democrat 0.167 NorthEast NY 0 33 18588460.0 3314224.0 14195327.0 1078909.0 0.178295 0.763663 0.058042
In [57]:
def plot_race_map(arg, r):
    data_slider = []
    for year in wins_by_state.year.unique():
        df = race_df[(race_df['year'] == year)]
        df = df.nlargest(10, [r])
        df['text'] = df[r]

        by_year = dict(
            type='choropleth',
            locations=df['state_code'],
            z=df[arg].astype(float),
            locationmode='USA-states',
            colorscale=scl,
            text=df['text']
        )

        data_slider.append(by_year)
    
    steps = []
    count = 0
    count = 1976

    for i in range(len(data_slider)):
        step = dict(method='restyle',
                    args=['visible', [False] * len(data_slider)],
                    label='Year {}'.format(count) # label to be displayed for each step (year)
                   ) 
        step['args'][1][i] = True
        steps.append(step)
        count += 4

    sliders = [dict(active=10, pad={"t": 1}, steps=steps)]
    
    layout = dict(
        geo=dict(scope='usa', projection={'type': 'albers usa'}),
        sliders=sliders
    )
    
    fig = dict(data=data_slider, layout=layout)
    plotly.offline.iplot(fig, show_link=False)

Over here, we can see that states with the largest Black proportion of Black people are in the South and East Coast. However, since, as previously mentioned, the majority of Black voters are democrats, it is odd to see this. This could be due to low voter turnout or even voter suppression. https://www.pewresearch.org/politics/2018/03/20/1-trends-in-party-affiliation-among-demographic-groups/

In [58]:
plot_race_map('party_class', 'black_std')

The states with the highest density of White people seem to reside in the Midwest and NorthEast. Since White people are generally split on democrats and republicans, this does make sense and shows where White Democrats and where White Republicans reside. https://www.pewresearch.org/politics/2018/03/20/1-trends-in-party-affiliation-among-demographic-groups/

In [59]:
plot_race_map('party_class', 'white_std')

This shows that 'Other', including Hispanic, Pacific Islander, Asian, and Native Americans, live in the West Coast and in New York/New Jersey (most likely around NYC). The Hispanic and Asian population are usually overwhelmingly democratic voters. https://www.pewresearch.org/politics/2018/03/20/1-trends-in-party-affiliation-among-demographic-groups/

In [60]:
plot_race_map('party_class', 'other_std')

Regression Analysis

Previously, we saw a bar graph that showed the representative votes and the actual votes (electoral). Before, with the three-pronged bar graph, we saw that the expected value of electoral votes was not exactly the same as the win percentage of a certain candidate (scaled to 538). The slight difference between how the electoral votes were actually divided are perplexing, so plots of population of states and their respective vote counts were created to see any sort of trend with certain states. We divided these plots into 8-year intervals to see how exactly the mean population for each state affected a state’s electoral votes.

In [61]:
soc_reg['total_pop'] = soc_reg['black'] + soc_reg['white'] + soc_reg['other']
len(soc_reg.columns)
Out[61]:
23
In [62]:
total_pop = wins_by_state.merge(updated_elec_votes, on=['year', 'state'])
total_pop['total_pop'] = soc_reg['total_pop']
In [63]:
total_pop['period_intervals'] = pd.cut(total_pop['year'], 5)
In [64]:
eight_year_average = pd.DataFrame({
'average_pop': total_pop.groupby(['period_intervals', 'state'])['total_pop'].mean(),
'average_evotes': total_pop.groupby(['period_intervals', 'state'])['elec_votes'].mean()
}).dropna().reset_index()
In [65]:
eight_year_average
Out[65]:
period_intervals state average_pop average_evotes
0 (1975.96, 1984.0] Alabama 3.862825e+06 9.000000
1 (1975.96, 1984.0] Alaska 4.371250e+05 3.000000
2 (1975.96, 1984.0] Arizona 2.716978e+06 6.333333
3 (1975.96, 1984.0] Arkansas 2.259082e+06 6.000000
4 (1975.96, 1984.0] California 2.385727e+07 45.666667
... ... ... ... ...
250 (2008.0, 2016.0] Virginia 8.297593e+06 13.000000
251 (2008.0, 2016.0] Washington 7.095914e+06 12.000000
252 (2008.0, 2016.0] West Virginia 1.843948e+06 5.000000
253 (2008.0, 2016.0] Wisconsin 5.746294e+06 10.000000
254 (2008.0, 2016.0] Wyoming 5.802600e+05 3.000000

255 rows × 4 columns

In [66]:
lst = sorted(list(set(eight_year_average['period_intervals'])))
lst
Out[66]:
[Interval(1975.96, 1984.0, closed='right'),
 Interval(1984.0, 1992.0, closed='right'),
 Interval(1992.0, 2000.0, closed='right'),
 Interval(2000.0, 2008.0, closed='right'),
 Interval(2008.0, 2016.0, closed='right')]
In [67]:
def plot_intervals(interval_num, year_range):
    fig, ax = plt.subplots(figsize=(20, 15))
    interval = eight_year_average[eight_year_average['period_intervals'] == lst[interval_num]]
    
    for i, row in interval.iterrows():
        ax.plot(row['average_pop'], row['average_evotes'], 'o')
        plt.annotate(row['state'], (row['average_pop'], row['average_evotes']))
        
    model = np.polyfit(interval['average_pop'], interval['average_evotes'], 1)
    predict = np.poly1d(model)
    x_lin_reg = list(interval['average_pop'])
    y_lin_reg = predict(x_lin_reg)
    plt.xlabel('Mean Population')
    plt.ylabel('Electoral Votes')
    plt.title(f'Electoral Votes: {year_range}')
    plt.plot(x_lin_reg, y_lin_reg, c='black')
    plt.show()
In [68]:
plot_intervals(0, '1976 - 1984')
In [69]:
plot_intervals(1, '1984 - 1992')
In [70]:
plot_intervals(2, '1992 - 2000')
In [71]:
plot_intervals(3, '2000 - 2008')
In [72]:
plot_intervals(4, '2008 - 2016')

Over the five plots, it is quite evident that the population of a state has an effect on the electoral votes of that state. While California is always seen at the top with the most number of electoral votes, we see the number of electoral votes of some states, such as Texas, increasing over time. This is due to the increase in population in that state. New York’s electoral votes, however, seem to have a slight decrease over time although the population remained the same. To further confirm our assumption of electoral votes increasing with the population of a state, we will now perform a hypothesis test.

Hypothesis Testing

To examine the relationship between the number of votes per states and the state's total population, we shall conduct a hypothesis test.

In [73]:
from statsmodels.formula.api import ols
In [74]:
average = pd.DataFrame({
'average_pop': total_pop.groupby(['state'])['total_pop'].mean(),
'average_evotes': total_pop.groupby(['state'])['elec_votes'].mean()
}).dropna().reset_index()

reg = ols(formula='average_evotes ~ average_pop', data=average).fit()
print(reg.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:         average_evotes   R-squared:                       0.995
Model:                            OLS   Adj. R-squared:                  0.995
Method:                 Least Squares   F-statistic:                     9181.
Date:                Mon, 21 Dec 2020   Prob (F-statistic):           2.08e-57
Time:                        23:05:43   Log-Likelihood:                -52.034
No. Observations:                  51   AIC:                             108.1
Df Residuals:                      49   BIC:                             111.9
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       2.1301      0.130     16.379      0.000       1.869       2.392
average_pop  1.592e-06   1.66e-08     95.818      0.000    1.56e-06    1.63e-06
==============================================================================
Omnibus:                        9.248   Durbin-Watson:                   1.761
Prob(Omnibus):                  0.010   Jarque-Bera (JB):               18.420
Skew:                           0.299   Prob(JB):                     0.000100
Kurtosis:                       5.883   Cond. No.                     1.06e+07
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.06e+07. This might indicate that there are
strong multicollinearity or other numerical problems.

Using the statsmodel library, we see the regression results of the dataset that takes into account the mean population per state(x-value) and the number of electoral votes per state(y-value). We take $\beta_1$ to be the coefficient of the x-value. We are testing this at a 5% significance level, therefore, $\alpha = 0.05$.

Hypothesis:

$H_0$: $\beta_1$ = 0 (null)

$H_a$: $\beta_1 \neq 0$ (alternate)

Decision Rule:

If the p-value of $\beta_1$ (average_pop) is greater than the significance level, we fail to reject the null hypothesis.

If the p-value of $\beta_1$ (average_pop) is less than the significance level, we reject the null hypothesis.

Test Statisitc:

p-value = 0.000

We clearly see that p-value $< \alpha$.

Conclusion:

Since we found the p-value to be less than the significance level, we reject the null hypothesis. This means that we can conclude by saying there is a strong linear relationship between the population of a state and the electoral votes for that state.

Prediction using Decision Tree Classifier

In this section, we will try and predict the results for the 2020 election and compare it with the outcome. As we all know, this year, the democratic party won the election. Since the 2020 census data has not yet been made available, we predict it on the 2019 census data. This is the closest representation to the current census. The data has been extracted from

https://www.kff.org/other/state-indicator/distribution-by-raceethnicity/?dataView=1&currentTimeframe=0&sortModel=%7B%22colId%22:%22Location%22,%22sort%22:%22asc%22%7D

We initially clean the dataset to count the total population per state for each race. We then train the model using the popular candidate dataset. The party class column in this dataset represents the outcome of the state (democrat or republican). hence, this is used as the y value in the model. The model is trained using states, race, and region as the features and can be seen in the 2019 census data as well. After training the model with the existing data with the outcomes for each state, we predict the outcome using the 2019 census data.

0 is classified as republican while 1 is classified as democrat. The features included are: race population (black, white, other), states (dummy), region (dummy), total population

In [75]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
In [76]:
pop_2019 = pd.read_csv('raw_data_2019.csv')
pop_2019 = pop_2019.head(50)
pop_2019 = pop_2019.replace(np.nan, 0)
pop_2019['other'] = pop_2019['Hispanic'] + pop_2019['Asian'] + pop_2019['American Indian/Alaska Native'] + \
                    pop_2019['Native Hawaiian/Other Pacific Islander'] + pop_2019['Multiple Races']
pop_2019 = pop_2019.drop(columns=[
    'Hispanic',
    'Asian',
    'American Indian/Alaska Native',
    'Native Hawaiian/Other Pacific Islander',
    'Multiple Races',
    'Total'
])
pop_2019 = pop_2019.rename(columns={
    'White': 'white',
    'Black': 'black',
    'Location': 'state'
})
pop_2019['region'] = pop_2019['state'].apply(lambda x: state_region[us_state_abbrev[x]])
pop_2019['total_pop'] = pop_2019['white'] + pop_2019['black'] + pop_2019['other']
In [77]:
data = soc_reg[soc_reg['state'] != 'District of Columbia']
data = pd.get_dummies(data, columns=['state', 'region'], drop_first=True)
data['black'] = data['BlackFemale'] + data['BlackMale']
data['white'] = data['WhiteFemale'] + data['WhiteMale']
data['other'] = data['OtherFemale'] + data['OtherMale']
data = data.drop(columns=['black_std', 'white_std', 'other_std', 'year', 'party', 'total_families', 'poor_families', 'BlackFemale', 'BlackMale', 'WhiteFemale', 'WhiteMale', 'OtherFemale', 'OtherMale'])
In [78]:
X = data.drop(columns=['party_class', 'percent', 'state_code', 'elec_votes'])
y = np.array(data["party_class"])
In [79]:
X_test = pd.get_dummies(pop_2019, columns=['state', 'region'], drop_first=True)
clf = DecisionTreeClassifier(criterion="entropy", random_state = 0).fit(X, y)
tree_y_predicted = clf.predict(X_test)
In [80]:
electoral_votes_2016 = wins_by_state[wins_by_state['year'] == 2016]
electoral_votes_2016 = electoral_votes_2016.reset_index().drop(columns=['index'])
electoral_votes_2016 = electoral_votes_2016.merge(updated_elec_votes, on=['year', 'state'])

After getting the predicted outcomes for each state, we are trying to count the number of electoral votes for republicans vs democrats.

In [81]:
republican_votes = 0
democrat_votes = 0

for i in range(len(tree_y_predicted)):
    
    if tree_y_predicted[i] == 0:
        republican_votes += electoral_votes_2016.loc[i, 'elec_votes']
    else:
        democrat_votes += electoral_votes_2016.loc[i, 'elec_votes']
    
print(f'Republicans: \t{republican_votes} electoral votes,\nDemocrats: \t{democrat_votes} electoral votes')
Republicans: 	244 electoral votes,
Democrats: 	291 electoral votes

We find that the democrats beat the republicans (291 vs 244) which is correct according to the 2020 election results. The prediction was very close to the actual election results: Joe Biden got 306 electoral votes whereas Donald Trump earned 232 electoral votes. This classification essentially tells us how the population and race for each state helps us to determine or predict the election results.

Conclusion

Through this entire project, we got a good understanding of how the electoral college works: the roles of population, race, and electoral votes in the election. By exploring the data, we understood that even if a candidate does not have the majority of the popular votes, the candidate can still win the election. In the project, it is evident that the “winner takes all” system works for many of the candidates and we saw that through the regression analysis as well as the hypothesis test, identifying that the number of electoral votes per state is related to the population of a state. We also explored the states that have the most impact on the election as well as the outliers. This gave us more insight and also led us to our next step of analyzing the election data related to the population grouped by the race per state. Analyzing all of these took us back to the claims we made in the introduction of criticizing the electoral college.